March Madness Marchine Learning Mania
2nd Place Kaggle Solution
Reference to any specific commercial product, process, or service, or the use of any trade, firm, or corporation name is for the information and convenience of the public and does not constitute endorsement, recommendation, or favoring by the Department of Defense of United States Government.
Introduction
Who are we?
MAJ Dusty Turner
- Education
- United States Military Academy 07, Bachelor of Science in Operations Research
- Missouri University of Science and Technology, Master of Science in Engineering Management
- THE Ohio State University, Master of Science in Industrial and Systems Engineering
- THE Ohio State University, Graduate Minor in Applied Statistics
- Work
- Schofield Barracks, Hawaii, Engineer Patoon Leader / Executive Officer / Assistant S3 (Operation Iraqi Freedom)
- White Sands Missile Range, New Mexico, Engineer A S3/Commander (Operation Enduring Freedom)
- West Point, New York, Assistant Professor
- Fort Belvoir, Virginia, Operations Research Systems Analyst / Data Scientist
MAJ Jim Pleuss
- Education
- United States Military Academy 07, Bachelor of Science in Computer Science
- Kansas State University, Master of Science in Operations Research
- Work
- Okinawa, Japan, Signal Battalion Platoon Leader / Executive Officer / Assistant S3
- Fort Riley, Kansas, Battalion S6 / Commander (Operation Enduring Freedom)
- West Point, New York, Assistant Professor / Assistant Dean for Plans, Analysis, and Personnel
- Okinawa, Japan, Signal Battalion Platoon Leader / Executive Officer / Assistant S3
Both of us
- Starting PhD program in Fall 2022 with follow-on to Math Department at USMA
- Watch too much sports
- Enjoy R
- Kaggle history
Who are you?
Hopefully you are…
- Someone really interested in basketball
- Someone who really enjoys machine learning
What is Kaggle?
Kaggle is pronounced with a short ‘a’ and rhymes with ‘gaggle’, ‘haggle’, etc.
Kaggle hosts machine learning competitions ranging from image classification, text analysis, accuracy competitions, and games. Monetary prizes are awarded for some competitions. Kagglers are ranked for their contributions in competitions, data-sets, notebooks, and discussion participation.
Companies sponsor competitions to hire talent and to have their problems solved fairly cheaply by a community of workers.
What is Kaggle’s March Madness Machine Learning Mania Competition?
Yearly, Kaggle hosts an National Collegiate Athletic Association (NCAA) Men’s tournament competition. Unlike traditional March Madness competitions, in these competitions teams submit a probability of victory for one team over the other for every possible combination of games that might be played (vs. a traditional bracket selection). This guarantees you have made a pick for every game.
Teams are scored in the following way:
\[-\frac{1}{n}\sum_{i=1}^{n} [ y_i \ln{(\hat{y}_i)} + (1-y_i) \ln{(1-\hat{y}_i)}]\]
where
- \(n\) is the number of games played
- \(\hat{y}_i\) is the predicted probability of team 1 beating team 2
- \(y_i\) is 1 if team 1 wins, 0 if team 2 wins
Example Submission
# A tibble: 6 x 2
ID Pred
<chr> <dbl>
1 2021_1101_1104 0.181
2 2021_1101_1111 0.459
3 2021_1101_1116 0.0883
4 2021_1101_1124 0.213
5 2021_1101_1140 0.283
6 2021_1101_1155 0.322
Visual of Scoring
Basketball Philosophy
- Focus on how teams are playing at the end of the season. Games in November don’t mean a lot in March.
- March Madness is all about match-ups. Give the model the opportunity to find favorable match-ups by providing data specifying “types” of teams (e.g., slow paced teams, run and gun, highly efficient, 3-point shooting teams).
- Take advantage of the wisdom of crowds.
Data
Kaggle provides a lot of game and team-level data but it is very raw and granular.
Ranking Data
Pull all the team rankings across 40 different ranking agencies for each team at the end of the season. We used the average of all these rankings to gain the “wisdom of the crowds”.
seeds <-
read_csv(here("01_data/MNCAATourneySeeds.csv")) %>%
mutate(Seed = as.numeric(str_sub(Seed,2,3)))
all_team_ids <- read_csv(here("01_data/MMasseyOrdinals.csv")) %>%
select(TeamID) %>% distinct() %>% pull() %>% sort()
team_df <-
read_csv(here("01_data/MRegularSeasonDetailedResults.csv")) %>%
select(Season, WTeamID, LTeamID) %>%
pivot_longer(cols = contains("Team"), values_to = "TeamID") %>%
distinct(Season,TeamID) %>%
arrange(Season,TeamID)
r_of_interest <- read_csv(here("01_data/MMasseyOrdinals.csv")) %>% filter(SystemName %in% c("SAG","POM","MOR","WLK", "RPI"))
rankings <-
expand_grid(Season = 2003:2021, RankingDayNum = 1:154, TeamID = all_team_ids, SystemName = unique(r_of_interest$SystemName)) %>%
left_join(
r_of_interest
) %>%
arrange(Season,RankingDayNum) %>%
group_by(Season,TeamID, SystemName) %>%
fill(SystemName, OrdinalRank, .direction = "down") %>%
fill(SystemName, OrdinalRank, .direction = "up") %>%
ungroup() %>%
filter(!is.na(OrdinalRank)) %>%
pivot_wider(names_from = SystemName, values_from = OrdinalRank, values_fn = mean) %>%
group_by(Season, TeamID) %>%
slice_max(RankingDayNum, n = 1) %>%
ungroup()
head(rankings)# A tibble: 6 x 8
Season RankingDayNum TeamID MOR POM RPI SAG WLK
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2003 154 1102 132 160 158 149 165
2 2003 154 1103 139 163 182 172 172
3 2003 154 1104 26 33 38 37 36
4 2003 154 1105 309 307 313 312 310
5 2003 154 1106 294 263 248 268 254
6 2003 154 1107 316 312 294 307 309
End of Season Summary Statistics
Summarize all team statistics at the end of the regular season.
summary_stats <-
read_csv(here("01_data/MRegularSeasonDetailedResults.csv")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "W",replacement = "A"), .cols = starts_with("W")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "L",replacement = "B"), .cols = starts_with("L")) %>%
mutate(win = if_else(AScore > BScore, 1, 0), win_by = AScore - BScore) %>%
select(Season, DayNum, TeamID = ATeamID, AScore, Loc = ALoc, AFGM:APF, win, win_by) %>%
pivot_longer(cols = -c(Season,DayNum,TeamID,Loc)) %>%
bind_rows(
read_csv(here("01_data/MRegularSeasonDetailedResults.csv")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "W",replacement = "A"), .cols = starts_with("W")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "L",replacement = "B"), .cols = starts_with("L")) %>%
mutate(win = 0, win_by = BScore - AScore) %>%
select(Season, DayNum, TeamID = BTeamID, BScore, Loc = ALoc, BFGM:BPF, win, win_by) %>%
mutate(Loc = case_when(Loc == "A" ~ "H",
Loc == "H" ~ "A",
TRUE ~ Loc)) %>%
pivot_longer(cols = -c(Season,DayNum,TeamID, Loc))
) %>%
mutate(name = str_sub(name, 2,-1)) %>%
group_by(Season, TeamID, name) %>%
summarise(summary_val = mean(value, na.rm = T)) %>%
pivot_wider(names_from = name, values_from = summary_val) %>%
ungroup() %>%
rename(avg_win = `in`, avg_win_by = `in_by`) %>%
relocate(c(avg_win,avg_win_by), .after = last_col())
summary_stats_final <-
read_csv(here("01_data/MNCAATourneyDetailedResults.csv")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "W",replacement = "A"), .cols = starts_with("W")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "L",replacement = "B"), .cols = starts_with("L")) %>%
mutate(win = if_else(AScore > BScore, 1, 0), win_by = AScore - BScore) %>%
select(Season, DayNum, TeamID = ATeamID, AScore, AFGM:APF, win, win_by) %>%
pivot_longer(cols = -c(Season,DayNum,TeamID)) %>%
bind_rows(
read_csv(here("01_data/MNCAATourneyDetailedResults.csv")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "W",replacement = "A"), .cols = starts_with("W")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "L",replacement = "B"), .cols = starts_with("L")) %>%
mutate(win = 0, win_by = BScore - AScore) %>%
select(Season, DayNum, TeamID = BTeamID, BScore, BFGM:BPF, win, win_by) %>%
pivot_longer(cols = -c(Season,DayNum,TeamID))
) %>%
mutate(name = str_sub(name, 2,-1)) %>%
group_by(Season, TeamID, name) %>%
summarise(summary_val = mean(value, na.rm = T)) %>%
pivot_wider(names_from = name, values_from = summary_val) %>%
ungroup() %>%
rename(avg_win = `in`, avg_win_by = `in_by`) %>%
relocate(c(avg_win,avg_win_by), .after = last_col())
head(summary_stats_final)# A tibble: 6 x 18
Season TeamID Ast Blk DR FGA FGA3 FGM FGM3 FTA FTM OR
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2003 1104 13 6 20 52 12 22 5 16 13 9
2 2003 1112 18.8 4.5 30.2 67.8 20.8 31 7.75 19.2 15 13.5
3 2003 1113 14.5 4.5 20.5 64 14.5 29.5 5 22.5 16 13.5
4 2003 1120 12.3 5.33 23.7 58.7 19.7 25.3 7 16.7 12.7 13.3
5 2003 1122 11 3 21 54 21 24 2 22 14 8
6 2003 1139 12 1.33 17 47.3 17.7 21.3 7.67 14.7 9.67 8
# ... with 6 more variables: PF <dbl>, Score <dbl>, Stl <dbl>, TO <dbl>,
# avg_win <dbl>, avg_win_by <dbl>
External Efficiency Data
These data came from the team rankings website. It is a snapshot of several advanced end of season statistics.
- Offensive/Defensive efficiency
- Possessions per game
- Free throw rate
- Strength of schedule
- Percentage of points from 3-pointers
Hierarchical Clustering
We then used hierarchical clustering to break every Division 1 team from the past 15 years into 10 clusters using many of the metrics above.
ranking_data <-
read_csv(here("01_data/ranking_data_2.csv"), guess_max = 10000) %>%
relocate(TeamID) %>%
janitor::clean_names() %>%
select(TeamID = team_id, Season = season, rank_avg, OE = overalloffensive_efficiency, DE = overalldefensive_efficiency,clust, last3offensive_efficiency, last3possessions_per_game,
last3defensive_efficiency, overallfree_throw_rate, col, ratingschedule_strength_by_other,overallpercent_of_points_from_3_pointers,last3_change,t3_week_rank_avg)
head(ranking_data)# A tibble: 6 x 15
TeamID Season rank_avg OE DE clust last3offensive_ef~ last3possessions_~
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1266 2003 12.8 1.19 1.01 1 1.12 73.1
2 1196 2003 9.82 1.13 0.96 1 1.08 68.4
3 1166 2003 20.7 1.12 0.916 2 1.03 68.9
4 1139 2003 41.1 1.12 0.984 3 1.06 60.6
5 1462 2003 15.7 1.11 0.927 2 1.21 70
6 1458 2003 19.8 1.11 0.909 1 1.13 59.7
# ... with 7 more variables: last3defensive_efficiency <dbl>,
# overallfree_throw_rate <dbl>, col <dbl>,
# ratingschedule_strength_by_other <dbl>,
# overallpercent_of_points_from_3_pointers <dbl>, last3_change <dbl>,
# t3_week_rank_avg <dbl>
Quad 1 Wins and Quad 4 Losses
This gives the number of wins against top teams and losses against bottom teams based on their average ordinal ranking at the end of the season.
quality_win_tracker <-
read_csv(here("01_data/MRegularSeasonDetailedResults.csv")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "W",replacement = "A"), .cols = starts_with("W")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "L",replacement = "B"), .cols = starts_with("L")) %>%
mutate(win = if_else(AScore > BScore, 1, 0), win_by = AScore - BScore) %>%
select(Season, DayNum, ATeamID, AScore, BTeamID, BScore, win, win_by) %>%
bind_rows(
read_csv(here("01_data/MRegularSeasonDetailedResults.csv")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "W",replacement = "B"), .cols = starts_with("W")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "L",replacement = "A"), .cols = starts_with("L")) %>%
mutate(win = 0, win_by = BScore - AScore) %>%
select(Season, DayNum, ATeamID, AScore, BTeamID, BScore, win, win_by)
) %>%
arrange(Season, DayNum) %>%
group_by(Season, ATeamID) %>%
mutate(AGames = row_number(), AWins = cumsum(win), ALosses = AGames - AWins) %>%
group_by(Season, BTeamID) %>%
mutate(BGames = row_number(), BLosses = cumsum(win), BWins = BGames - BLosses) %>%
ungroup() %>%
mutate(win_perc_A = AWins / AGames, win_perc_B = BWins / BGames) %>%
filter(AGames >= 10 & BGames >= 10) %>%
group_by(Season, ATeamID) %>%
summarise(good_wins = sum(win == 1 & win_perc_B > .6), bad_loss = sum(win == 0 & win_perc_B < .3), wins = max(AWins), losses = max(BLosses)) %>%
rename(TeamID = ATeamID) %>%
ungroup()
quad_win_helper <-
read_csv(here("01_data/MRegularSeasonDetailedResults.csv")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "W",replacement = "A"), .cols = starts_with("W")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "L",replacement = "B"), .cols = starts_with("L")) %>%
mutate(win = if_else(AScore > BScore, 1, 0), win_by = AScore - BScore) %>%
select(Season, DayNum, ATeamID, AScore, BTeamID, BScore, win, win_by) %>%
bind_rows(
read_csv(here("01_data/MRegularSeasonDetailedResults.csv")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "W",replacement = "B"), .cols = starts_with("W")) %>%
rename_with(.fn = ~str_replace(string = .,pattern = "L",replacement = "A"), .cols = starts_with("L")) %>%
mutate(win = 0, win_by = BScore - AScore) %>%
select(Season, DayNum, ATeamID, AScore, BTeamID, BScore, win, win_by)
) %>%
left_join(ranking_data %>% select(TeamID, Season,rank_avg_B = rank_avg), by = c("Season", "BTeamID" = "TeamID")) %>%
mutate(game_of_interest_A = rank_avg_B < 30, game_of_interest_A_bad = rank_avg_B > 75)
quad_win_tracker <-
quad_win_helper %>%
count(Season, ATeamID, win, game_of_interest_A) %>%
filter(!is.na(game_of_interest_A)) %>%
filter(game_of_interest_A) %>%
filter(win == 1) %>%
select(Season, TeamID = ATeamID, quad_wins = n) %>%
full_join(
quad_win_helper %>%
count(Season, BTeamID, win, game_of_interest_A_bad) %>%
filter(!is.na(game_of_interest_A_bad)) %>%
filter(game_of_interest_A_bad) %>%
filter(win == 0) %>%
select(Season, TeamID = BTeamID, quad_loss = n)
) %>%
mutate(across(.cols = c(quad_wins, quad_loss), .fns = ~replace_na(.,0)))
head(quad_win_helper)# A tibble: 6 x 11
Season DayNum ATeamID AScore BTeamID BScore win win_by rank_avg_B
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2003 10 1104 68 1328 62 1 6 6.12
2 2003 10 1272 70 1393 63 1 7 13.1
3 2003 11 1266 73 1437 61 1 12 74.4
4 2003 11 1296 56 1457 50 1 6 192.
5 2003 11 1400 77 1208 71 1 6 12.7
6 2003 11 1458 81 1186 55 1 26 120.
# ... with 2 more variables: game_of_interest_A <lgl>,
# game_of_interest_A_bad <lgl>
Conference Comparison Data
Track wins and losses for each team’s conference against their opponent’s conference.
conf_start <-
read_csv(here("01_data/MRegularSeasonDetailedResults.csv")) %>%
select(Season, WTeamID, LTeamID, WScore, LScore) %>%
left_join(
read_csv(here("01_data/MTeamConferences.csv")), by = c("Season", "WTeamID" = "TeamID")
) %>%
rename(WConf = ConfAbbrev) %>%
left_join(
read_csv(here("01_data/MTeamConferences.csv")), by = c("Season", "LTeamID" = "TeamID")
) %>%
rename(LConf = ConfAbbrev) %>%
count(Season, WConf, LConf) %>%
mutate(n = ifelse(WConf == LConf, 0, 1))
conf_rank <-
conf_start %>%
left_join(conf_start %>% select(WConf, LConf, Season, n), by = c("Season", "LConf" = "WConf", "WConf" = "LConf")) %>%
rename(conf_wins = n.x, conf_loss = n.y) %>%
mutate(conf_loss = replace_na(conf_loss, 0)) %>%
mutate(conf_record = str_c(WConf, "_", LConf)) %>%
select(Season, conf_record, contains("conf_"))
head(conf_rank)# A tibble: 6 x 4
Season conf_record conf_wins conf_loss
<dbl> <chr> <dbl> <dbl>
1 2003 a_sun_a_sun 0 0
2 2003 a_sun_a_ten 1 0
3 2003 a_sun_aec 1 1
4 2003 a_sun_big_east 1 1
5 2003 a_sun_big_south 1 1
6 2003 a_sun_big_west 1 1
Final Modeling Data
This compiles all the data for building the model in the next step. We used every tournament game since 2003 to build our model. Each factor had to be represented twice in each line of code—once for each team.
base_builder <-
read_csv(here("01_data/MNCAATourneyDetailedResults.csv")) %>%
select(Season,contains("Team"), contains("Score"), DayNum) %>%
mutate(win_by = WScore - LScore) %>%
relocate(-win_by) %>%
mutate(win = if_else(win_by > 0, "win", "lose")) %>%
select(Season, WTeamID, LTeamID, DayNum, win_by, win)
base_builder <-
base_builder %>%
bind_rows(
base_builder %>% rename(LTeamID = WTeamID, WTeamID = LTeamID) %>%
mutate(win_by = -win_by, win = "lose")
)
staging_data <-
ranking_data %>%
left_join(summary_stats, by = c("Season","TeamID")) %>%
distinct() %>%
left_join(quality_win_tracker) %>%
left_join(
read_csv(here("01_data/MTeamConferences.csv")), by = c("Season", "TeamID")
) %>%
relocate(ConfAbbrev, .after = Season) %>%
rename(conf = ConfAbbrev)
model_data <-
base_builder %>%
left_join(staging_data, by = c("WTeamID" = "TeamID", "Season")) %>%
left_join(staging_data, by = c("LTeamID" = "TeamID", "Season"), suffix = c("_A", "_B")) %>%
mutate(conf_record_one = str_c(conf_A, "_", conf_B)) %>%
mutate(conf_record_two = str_c(conf_B, "_", conf_A)) %>%
left_join(conf_rank, by = c("Season", "conf_record_one" = "conf_record")) %>%
left_join(conf_rank, by = c("Season", "conf_record_two" = "conf_record"), suffix = c("_against_B", "_against_A")) %>%
select(-c(conf_record_one,conf_record_two,conf_A,conf_B)) %>%
mutate(across(.cols = contains("conf_"),.fns = ~replace_na(.,0))) %>%
left_join(rankings, by = c("WTeamID" = "TeamID", "Season")) %>%
left_join(rankings, by = c("LTeamID" = "TeamID", "Season"), suffix = c("_A", "_B")) %>%
left_join(seeds, by = c("WTeamID" = "TeamID", "Season")) %>%
left_join(seeds, by = c("LTeamID" = "TeamID", "Season"), suffix = c("_A", "_B")) %>%
left_join(quad_win_tracker, by = c("WTeamID" = "TeamID", "Season")) %>%
left_join(quad_win_tracker, by = c("LTeamID" = "TeamID", "Season"), suffix = c("_A", "_B")) %>%
group_split(Season < 2015) %>%
set_names(c("Test","Train")) %>%
map(~select(.,-`Season < 2015`))
head(model_data)$Test
# A tibble: 670 x 94
Season WTeamID LTeamID DayNum win_by win rank_avg_A OE_A DE_A clust_A
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 2015 1214 1264 134 10 win 253. 0.928 0.958 9
2 2015 1279 1140 134 4 win 49.2 1.06 0.98 1
3 2015 1173 1129 135 1 win 42.5 1.04 0.915 1
4 2015 1352 1316 135 4 win 173. 0.998 0.981 5
5 2015 1112 1411 136 21 win 4.21 1.12 0.857 1
6 2015 1116 1459 136 3 win 22.3 1.08 0.964 2
7 2015 1139 1400 136 8 win 24.5 1.04 0.91 1
8 2015 1153 1345 136 1 win 45.8 1.01 0.891 5
9 2015 1207 1186 136 10 win 23.2 1.05 0.958 1
10 2015 1209 1124 136 1 win 77.5 1.07 0.939 1
# ... with 660 more rows, and 84 more variables:
# last3offensive_efficiency_A <dbl>, last3possessions_per_game_A <dbl>,
# last3defensive_efficiency_A <dbl>, overallfree_throw_rate_A <dbl>,
# col_A <dbl>, ratingschedule_strength_by_other_A <dbl>,
# overallpercent_of_points_from_3_pointers_A <dbl>, last3_change_A <dbl>,
# t3_week_rank_avg_A <dbl>, Ast_A <dbl>, Blk_A <dbl>, DR_A <dbl>,
# FGA_A <dbl>, FGA3_A <dbl>, FGM_A <dbl>, FGM3_A <dbl>, FTA_A <dbl>,
# FTM_A <dbl>, OR_A <dbl>, PF_A <dbl>, Score_A <dbl>, Stl_A <dbl>,
# TO_A <dbl>, avg_win_A <dbl>, avg_win_by_A <dbl>, good_wins_A <int>,
# bad_loss_A <int>, wins_A <dbl>, losses_A <dbl>, rank_avg_B <dbl>,
# OE_B <dbl>, DE_B <dbl>, clust_B <dbl>, last3offensive_efficiency_B <dbl>,
# last3possessions_per_game_B <dbl>, last3defensive_efficiency_B <dbl>,
# overallfree_throw_rate_B <dbl>, col_B <dbl>,
# ratingschedule_strength_by_other_B <dbl>,
# overallpercent_of_points_from_3_pointers_B <dbl>, last3_change_B <dbl>,
# t3_week_rank_avg_B <dbl>, Ast_B <dbl>, Blk_B <dbl>, DR_B <dbl>,
# FGA_B <dbl>, FGA3_B <dbl>, FGM_B <dbl>, FGM3_B <dbl>, FTA_B <dbl>,
# FTM_B <dbl>, OR_B <dbl>, PF_B <dbl>, Score_B <dbl>, Stl_B <dbl>,
# TO_B <dbl>, avg_win_B <dbl>, avg_win_by_B <dbl>, good_wins_B <int>,
# bad_loss_B <int>, wins_B <dbl>, losses_B <dbl>, conf_wins_against_B <dbl>,
# conf_loss_against_B <dbl>, conf_wins_against_A <dbl>,
# conf_loss_against_A <dbl>, RankingDayNum_A <dbl>, MOR_A <dbl>, POM_A <dbl>,
# RPI_A <dbl>, SAG_A <dbl>, WLK_A <dbl>, RankingDayNum_B <dbl>, MOR_B <dbl>,
# POM_B <dbl>, RPI_B <dbl>, SAG_B <dbl>, WLK_B <dbl>, Seed_A <dbl>,
# Seed_B <dbl>, quad_wins_A <dbl>, quad_loss_A <dbl>, quad_wins_B <dbl>,
# quad_loss_B <dbl>
$Train
# A tibble: 1,560 x 94
Season WTeamID LTeamID DayNum win_by win rank_avg_A OE_A DE_A clust_A
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 2003 1421 1411 134 8 win 240. 0.986 1.09 8
2 2003 1112 1436 136 29 win 2.68 1.09 0.891 2
3 2003 1113 1272 136 13 win 36 1.06 0.962 5
4 2003 1141 1166 136 6 win 45.7 1.06 0.994 5
5 2003 1143 1301 136 2 win 36.4 1.03 0.968 1
6 2003 1163 1140 136 5 win 27.5 1.07 0.946 2
7 2003 1181 1161 136 10 win 10.7 1.08 0.928 2
8 2003 1211 1153 136 5 win 43.1 1.08 0.953 1
9 2003 1228 1443 136 5 win 10.5 1.08 0.877 2
10 2003 1242 1429 136 3 win 5.97 1.08 0.872 2
# ... with 1,550 more rows, and 84 more variables:
# last3offensive_efficiency_A <dbl>, last3possessions_per_game_A <dbl>,
# last3defensive_efficiency_A <dbl>, overallfree_throw_rate_A <dbl>,
# col_A <dbl>, ratingschedule_strength_by_other_A <dbl>,
# overallpercent_of_points_from_3_pointers_A <dbl>, last3_change_A <dbl>,
# t3_week_rank_avg_A <dbl>, Ast_A <dbl>, Blk_A <dbl>, DR_A <dbl>,
# FGA_A <dbl>, FGA3_A <dbl>, FGM_A <dbl>, FGM3_A <dbl>, FTA_A <dbl>,
# FTM_A <dbl>, OR_A <dbl>, PF_A <dbl>, Score_A <dbl>, Stl_A <dbl>,
# TO_A <dbl>, avg_win_A <dbl>, avg_win_by_A <dbl>, good_wins_A <int>,
# bad_loss_A <int>, wins_A <dbl>, losses_A <dbl>, rank_avg_B <dbl>,
# OE_B <dbl>, DE_B <dbl>, clust_B <dbl>, last3offensive_efficiency_B <dbl>,
# last3possessions_per_game_B <dbl>, last3defensive_efficiency_B <dbl>,
# overallfree_throw_rate_B <dbl>, col_B <dbl>,
# ratingschedule_strength_by_other_B <dbl>,
# overallpercent_of_points_from_3_pointers_B <dbl>, last3_change_B <dbl>,
# t3_week_rank_avg_B <dbl>, Ast_B <dbl>, Blk_B <dbl>, DR_B <dbl>,
# FGA_B <dbl>, FGA3_B <dbl>, FGM_B <dbl>, FGM3_B <dbl>, FTA_B <dbl>,
# FTM_B <dbl>, OR_B <dbl>, PF_B <dbl>, Score_B <dbl>, Stl_B <dbl>,
# TO_B <dbl>, avg_win_B <dbl>, avg_win_by_B <dbl>, good_wins_B <int>,
# bad_loss_B <int>, wins_B <dbl>, losses_B <dbl>, conf_wins_against_B <dbl>,
# conf_loss_against_B <dbl>, conf_wins_against_A <dbl>,
# conf_loss_against_A <dbl>, RankingDayNum_A <dbl>, MOR_A <dbl>, POM_A <dbl>,
# RPI_A <dbl>, SAG_A <dbl>, WLK_A <dbl>, RankingDayNum_B <dbl>, MOR_B <dbl>,
# POM_B <dbl>, RPI_B <dbl>, SAG_B <dbl>, WLK_B <dbl>, Seed_A <dbl>,
# Seed_B <dbl>, quad_wins_A <dbl>, quad_loss_A <dbl>, quad_wins_B <dbl>,
# quad_loss_B <dbl>
Submission Preparation Data
Builds the data frame of factors for every possible pair of teams in the 2021 tournament, which we will use to make predictions and submit to Kaggle.
stage_2_submission_data <-
read_csv(here("./01_data/MSampleSubmissionStage2.csv")) %>%
select(-Pred) %>%
separate(col = ID, into = c("Season", "WTeamID", "LTeamID"), sep = "_", convert = T) %>%
left_join(staging_data, by = c("WTeamID" = "TeamID", "Season")) %>%
left_join(staging_data, by = c("LTeamID" = "TeamID", "Season"), suffix = c("_A", "_B")) %>%
mutate(conf_record_one = str_c(conf_A, "_", conf_B)) %>%
mutate(conf_record_two = str_c(conf_B, "_", conf_A)) %>%
left_join(conf_rank, by = c("Season", "conf_record_one" = "conf_record")) %>%
left_join(conf_rank, by = c("Season", "conf_record_two" = "conf_record"), suffix = c("_against_B", "_against_A")) %>%
select(-c(conf_record_one,conf_record_two,conf_A,conf_B)) %>%
mutate(across(.cols = contains("conf_"),.fns = ~replace_na(.,0))) %>%
left_join(rankings, by = c("WTeamID" = "TeamID", "Season")) %>%
left_join(rankings, by = c("LTeamID" = "TeamID", "Season"), suffix = c("_A", "_B")) %>%
left_join(seeds, by = c("WTeamID" = "TeamID", "Season")) %>%
left_join(seeds, by = c("LTeamID" = "TeamID", "Season"), suffix = c("_A", "_B")) %>%
left_join(quad_win_tracker, by = c("WTeamID" = "TeamID", "Season")) %>%
left_join(quad_win_tracker, by = c("LTeamID" = "TeamID", "Season"), suffix = c("_A", "_B"))
head(stage_2_submission_data)# A tibble: 6 x 91
Season WTeamID LTeamID rank_avg_A OE_A DE_A clust_A last3offensive_efficien~
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021 1101 1104 83.5 1.03 0.85 2 1.03
2 2021 1101 1111 83.5 1.03 0.85 2 1.03
3 2021 1101 1116 83.5 1.03 0.85 2 1.03
4 2021 1101 1124 83.5 1.03 0.85 2 1.03
5 2021 1101 1140 83.5 1.03 0.85 2 1.03
6 2021 1101 1155 83.5 1.03 0.85 2 1.03
# ... with 83 more variables: last3possessions_per_game_A <dbl>,
# last3defensive_efficiency_A <dbl>, overallfree_throw_rate_A <dbl>,
# col_A <dbl>, ratingschedule_strength_by_other_A <dbl>,
# overallpercent_of_points_from_3_pointers_A <dbl>, last3_change_A <dbl>,
# t3_week_rank_avg_A <dbl>, Ast_A <dbl>, Blk_A <dbl>, DR_A <dbl>,
# FGA_A <dbl>, FGA3_A <dbl>, FGM_A <dbl>, FGM3_A <dbl>, FTA_A <dbl>,
# FTM_A <dbl>, OR_A <dbl>, PF_A <dbl>, Score_A <dbl>, Stl_A <dbl>,
# TO_A <dbl>, avg_win_A <dbl>, avg_win_by_A <dbl>, good_wins_A <int>,
# bad_loss_A <int>, wins_A <dbl>, losses_A <dbl>, rank_avg_B <dbl>,
# OE_B <dbl>, DE_B <dbl>, clust_B <dbl>, last3offensive_efficiency_B <dbl>,
# last3possessions_per_game_B <dbl>, last3defensive_efficiency_B <dbl>,
# overallfree_throw_rate_B <dbl>, col_B <dbl>,
# ratingschedule_strength_by_other_B <dbl>,
# overallpercent_of_points_from_3_pointers_B <dbl>, last3_change_B <dbl>,
# t3_week_rank_avg_B <dbl>, Ast_B <dbl>, Blk_B <dbl>, DR_B <dbl>,
# FGA_B <dbl>, FGA3_B <dbl>, FGM_B <dbl>, FGM3_B <dbl>, FTA_B <dbl>,
# FTM_B <dbl>, OR_B <dbl>, PF_B <dbl>, Score_B <dbl>, Stl_B <dbl>,
# TO_B <dbl>, avg_win_B <dbl>, avg_win_by_B <dbl>, good_wins_B <int>,
# bad_loss_B <int>, wins_B <dbl>, losses_B <dbl>, conf_wins_against_B <dbl>,
# conf_loss_against_B <dbl>, conf_wins_against_A <dbl>,
# conf_loss_against_A <dbl>, RankingDayNum_A <dbl>, MOR_A <dbl>, POM_A <dbl>,
# RPI_A <dbl>, SAG_A <dbl>, WLK_A <dbl>, RankingDayNum_B <dbl>, MOR_B <dbl>,
# POM_B <dbl>, RPI_B <dbl>, SAG_B <dbl>, WLK_B <dbl>, Seed_A <dbl>,
# Seed_B <dbl>, quad_wins_A <dbl>, quad_loss_A <dbl>, quad_wins_B <dbl>,
# quad_loss_B <dbl>
Workflow
Version Control
We hosted all our code on github and used git to manage version control.
Organized Folder Structure
We managed our files in the following folder structure.
- 01_data
- 26 csvs
- 02_scripts
data_prep.Rsupport_functions.Rexecute.R
- 03_submissions
- submission files
- model_registry.csv
We wanted to minimize the “clutter” of our working script. We placed data cleaning / preparation code in the data_prep.R script and support functions we developed to help us model in the support_functions.R script.
Functional Modeling
We created one main function that would do the following:
- Split data into test/train
- Create a classification or regression model
- Use a elastic net, random forest, or boosted tree
- Accept a user-defined set of factors to create the model
- Tune the modeling parameters over a user-defined grid size
- Output appropriate accuracy metrics (Root Mean Squared Error (RMSE) or accuracy)
- Output a submission file to upload into the Kaggle competition
- Save model performance to the model registry (mode details below)
Track Models
We created a model registry to support our model building. When we first execute a model, the model execution function first checks to see if this is a repeat (often time consuming) model.
Then the registry would track the following information.
- Model engine (random forest, elastic net, boosted tree)
- Type of model (regression or classification)
- Regression equation (
win ~ factor 1 + factor 2 + ... + factor n) - Tuning grid size (for determining parameters)
- Model performance
- Performance metric (RMSE / Accuracy)
- Special notes
register_model <- function(current_model_data){
registry <- read_csv("model_registry.csv")
model_check <-
current_model_data %>%
inner_join(registry)
if(nrow(model_check) == 0){
message("This is a new model")
} else if(nrow(model_check) == 1){
stop(str_c("This model has been done before with an ", model_check$performance_measure, " of ", round(model_check$model_performance,6)))
}
return(registry)
}Example output from registry:
# A tibble: 6 x 7
type_of_model mode regression_formula grid_size notes model_performan~
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 random_forest class~ "~winrank_avg_A + ran~ 100 valida~ 0.671
2 random_forest class~ "~winrank_avg_A + ran~ 100 valida~ 0.562
3 random_forest class~ "~winrank_avg_A + ran~ 100 valida~ 0.526
4 random_forest class~ "~winrank_avg_A + ran~ 100 valida~ 0.532
5 random_forest class~ "~winrank_avg_A + ran~ 100 valida~ 0.598
6 random_forest class~ "~winrank_avg_A + ran~ 100 valida~ 0.500
# ... with 1 more variable: performance_measure <chr>
Methodology
- Build model off historical tournaments
- Use end of season data
- Regular season games not used in training data
Functions
Important Libraries
These were the main libraries.
Set Up Code
This code specified which libraries to use for specific function conflicts.
conflict_prefer(name = "pluck", winner = "purrr", losers = "rvest")
conflict_prefer(name = "filter", winner = "dplyr", losers = "stats")Workhorse Modeling Function
This is our main function used for modeling.
Inputs
- Training data
- Mode (classification or regression)
- Regression equation (
win ~ factor 1 + factor 2 + ... + factor n)
- Grid size
- Prediction data
- Notes
- Submission model
Outputs
- Updated model registry
- Accuracy/RMSE of the model
- File to submit to Kaggle
register_make_execute_evaluate <- function(data = stats,
type_of_model = "random_forest",
mode = "classification",
# mode = "regression",
regression_formula = formula(win ~ ARPI + ABPI + BRPI + BBPI),
# regression_formula = formula(win_by ~ ARPI + ABPI + BRPI + BBPI),
grid_size = 1,
prediction_data = stats_final,
notes = "current best model make into regression",
submission_mode = F
){
current_model_data <- tibble(type_of_model = type_of_model,
mode = mode,
regression_formula = str_c(as.character(regression_formula),collapse = ""),
grid_size = grid_size,
notes = notes)
if(!submission_mode){
reg <- register_model(current_model_data = current_model_data)
}
set.seed(123)
splits <- initial_split(data, strata = Season)
stats_other <- training(splits)
stats_test <- testing(splits)
set.seed(234)
val_set <- validation_split(stats_other,
strata = Season,
prop = 0.80)
cores <- parallel::detectCores()
message(str_c("you are using ", cores, " cores"))
if(type_of_model == "random_forest") {
mod <-
rand_forest(mtry = tune(),min_n = tune(),trees = tune()) %>%
set_engine("ranger", num.threads = cores, keep.inbag = TRUE)
} else if(type_of_model == "log_reg"){
mod <-
logistic_reg(penalty = tune(),mixture = tune()) %>%
set_engine("glmnet", num.threads = cores, keep.inbag = TRUE)
} else if(type_of_model == "boost_tree"){
mod <-
boost_tree(mtry = tune(),trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), sample_size = tune(), stop_iter = 10) %>%
set_engine("xgboost", num.threads = cores, keep.inbag = TRUE)
}
if (mode == "regression") {
mod <-
mod %>%
set_mode("regression")
} else if (mode == "classification") {
mod <-
mod %>%
set_mode("classification")
}
recipe <-
recipe(regression_formula, data = stats_other) %>%
step_interact(terms = ~ conf_wins_against_B /(conf_wins_against_B + conf_loss_against_B)) %>%
step_interact(terms = ~ conf_wins_against_A / (conf_wins_against_A + conf_loss_against_A)) %>%
step_nzv(all_predictors(), - all_outcomes()) %>%
step_knnimpute(all_predictors()) %>%
# step_meanimpute(all_numeric(), - all_outcomes()) %>%
step_center(all_numeric(), - all_outcomes()) %>%
step_scale(all_numeric(), - all_outcomes())
workflow <-
workflow() %>%
add_model(mod) %>%
add_recipe(recipe)
if(mode == "regression"){
metrics <- c("rmse", "rsq", "ccc")
metrics_id <- metric_set(rmse, rsq, ccc)
} else if(mode == "classification"){
metrics <- c("accuracy", "kap")
metrics_id <- metric_set(accuracy, kap)
}
message(str_c("Begin CV to tune parameters with grid size of ", grid_size, " with ", metrics[1]), " on a ", mode, " model.")
set.seed(345)
res <-
workflow %>%
tune_grid(val_set,
grid = grid_size,
control = control_grid(save_pred = TRUE, verbose = T),
metrics = metrics_id)
message(str_c("Complete CV to tune parameters with grid size of ", grid_size))
best <-
res %>%
select_best(metric = metrics[1])
if(type_of_model == "random_forest"){
last_mod <-
rand_forest(mtry = best$mtry, min_n = best$min_n, trees = best$trees) %>%
set_engine("ranger", num.threads = cores, keep.inbag=TRUE, importance = "impurity")
} else if(type_of_model == "log_reg"){
last_mod <-
logistic_reg(penalty = best$penalty, mixture = best$mixture) %>%
set_engine("glmnet", num.threads = cores, keep.inbag=TRUE)
} else if(type_of_model == "boost_tree"){
last_mod <-
boost_tree(mtry = best$min_n,trees = best$trees, min_n = best$min_n, tree_depth = best$tree_depth, learn_rate = best$learn_rate, loss_reduction = best$loss_reduction, sample_size = best$sample_size, stop_iter = 10) %>%
set_engine("xgboost", num.threads = cores, keep.inbag=TRUE)
}
if(mode == "regression"){
last_mod <-
last_mod %>%
set_mode("regression")
}else if(mode == "classification"){
last_mod <-
last_mod %>%
set_mode("classification")
}
last_workflow <-
workflow %>%
update_model(last_mod)
set.seed(345)
last_fit <-
last_workflow %>%
last_fit(splits)
message(str_c("Begin model on entire data"))
final_model <- fit(last_workflow, data)
if(mode == "regression"){
message(str_c("Begin make predictions"))
data_with_predictions <-
prediction_data %>%
bind_cols(predict(final_model, new_data = prediction_data))
if(!submission_mode){
message(str_c("Score Model"))
model_performance <-
data_with_predictions %>%
group_by(Season) %>%
rmse(truth = win_by, estimate = .pred) %>%
summarise(mean_RMSE = mean(.estimate))
message(str_c("This model has a ", metrics[1], " of ", round(model_performance$mean_RMSE,2)))
new_registry <-
reg %>%
bind_rows(
current_model_data %>%
mutate(model_performance = as.double(model_performance), performance_measure = metrics[1])
)
}
}else if(mode == "classification"){
message(str_c("Begin make predictions"))
data_with_predictions <-
prediction_data %>%
bind_cols(predict(final_model, new_data = prediction_data, type = "prob")) %>%
bind_cols(predict(final_model, new_data = prediction_data))
message("post set up data")
if(!submission_mode){
message(str_c("Score Model"))
model_performance <-
data_with_predictions %>%
mutate(across(.cols = c(.pred_lose,.pred_win),.fns = ~if_else(.>.999,.999,.) )) %>%
mutate(across(.cols = c(.pred_lose,.pred_win),.fns = ~if_else(.<.001,.001,.) )) %>%
mutate(kaggle = if_else(win == "win", -log(.pred_win), -log(.pred_lose))) %>%
group_by(Season) %>%
summarise(kaggle = mean(kaggle), accuracy = sum(.pred_class==win)/n()) %>%
summarise(mean_accuracy = mean(accuracy), mean_kaggle = mean(kaggle))
message(str_c("This model has a ", metrics[1], " of ", round(model_performance$mean_accuracy,2), " and Kaggle score of ", round(model_performance$mean_kaggle,6)))
new_registry <-
reg %>%
bind_rows(
current_model_data %>%
mutate(model_performance = as.double(model_performance$mean_kaggle), performance_measure = "Kaggle")
)
}
}
if(!submission_mode){
write_csv(new_registry, "model_registry.csv")
} else {
if(mode=="regression"){
data_with_predictions <-
data_with_predictions %>%
unite(col = ID, sep = "_", Season, WTeamID, LTeamID) %>%
select(ID, Pred = contains("pred"))
} else if(mode == "classification"){
data_with_predictions <-
data_with_predictions %>%
unite(col = ID, sep = "_", Season, WTeamID, LTeamID) %>%
select(ID, Pred = .pred_win)
}
id <-
tibble(files = list.files("03_submissions")) %>%
mutate(files = str_remove(files, ".csv")) %>%
separate(files, sep = "_", c("trash","trash2","trash3","date","id")) %>%
select(date, id) %>%
mutate(date = lubridate::ymd(date), id = as.integer(id)) %>%
filter(date >= lubridate::today()) %>%
summarise(id = max(id)) %>%
mutate(id = ifelse(id==-Inf,1,id + 1)) %>% pull(id)
write_csv(data_with_predictions, str_c("03_submissions/prediction_data_", mode,"_", lubridate::today(),"_",id, ".csv"))
}
return(data_with_predictions)
}Execution
We load the support functions and data before beginning our modeling.
Parallel Processing
We ‘register’ the cores on our computer to support parallel processing during model building — this greatly increases the speed of our code.
[1] 8
Execute Model
The code below can execute both a classification model to determine probability of winning or a regression model to estimate the points spread.
model <- register_make_execute_evaluate(data = model_data$Train, # model_data
# model <- register_make_execute_evaluate(data = model_data,
type_of_model = "random_forest", #"log_reg","boost_tree",
mode = "classification",
regression_formula = formula(win ~ last3defensive_efficiency_B + clust_A + clust_B + avg_win_A + avg_win_B +
overallfree_throw_rate_A + overallfree_throw_rate_B + col_A + col_B +
conf_wins_against_A + conf_wins_against_B + conf_loss_against_A + conf_loss_against_B +
quad_wins_A + quad_wins_B + quad_loss_A + quad_loss_B + Seed_A + Seed_B +
ratingschedule_strength_by_other_A + ratingschedule_strength_by_other_B +
t3_week_rank_avg_A + t3_week_rank_avg_B),
grid_size = 100,
notes = "building final model",
prediction_data = model_data$Test,
# prediction_data = stage_2_submission_data,
submission_mode = F
# submission_mode = T
)Variable Importance
The plot below shows the variable importance based off ‘tree impurity’.
Model Estimates
# A tibble: 2 x 10
TeamName rank_avg_A t3_week_rank_avg~ last3offensive_effi~ overallfree_throw_~
<chr> <dbl> <dbl> <dbl> <dbl>
1 Ohio 83.4 113. 1.09 21.6
2 Virginia 17.8 22.6 1.04 17.1
last3defensive_effic~ ratingschedule_streng~ last3possessions_pe~ Seed_A col_A
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1.01 0.1 74.2 13 79
2 0.964 10.2 61.2 4 20
# A tibble: 4 x 10
TeamName rank_avg_A t3_week_rank_avg~ last3offensive_eff~ overallfree_throw~
<chr> <dbl> <dbl> <dbl> <dbl>
1 Arkansas 13.3 13.8 1.14 24.2
2 Florida 41.5 29.2 0.975 23.7
3 Ohio St 10.7 10.8 0.949 27.6
4 Oral Robe~ 160. 178. 1.08 20.9
last3defensive_effic~ ratingschedule_streng~ last3possessions_pe~ Seed_A col_A
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.96 9.2 79.1 3 11
2 1.02 10.7 66.6 7 50
3 1.07 12.7 67.4 2 10
4 1.00 -1.8 74.6 15 181
Results
Expected Celebration
Actual celebration
Final Thoughts
Challenges
- Interactions of factors
- Coding the regression equation was clunky
- “Team A” vs. “Team B”
- Non balanced output: P(A beats B) != P(B beats A)
- Four days “flash to bang”
- Day jobs - we had to do real work!
Ideas for Next Year
- Explore how to incorporate regular season data (or at least conference championships) – We are throwing too much data away
- Functionalize automation of variable selection
- Explore team structure (strong front court vs. back court), star player, etc.
- Network science application
What Didn’t Work
- Bayesian model too computationally intensive
- AutoML
What Helped us be Successful
- Upsets (and the right ones)
- Git
- Functions
- Model registry
- Tidymodels
ESPN
Takeaways
- This is for fun - and we had fun
- Our workflow can be used broadly (e.g., learned it from COVID modeling)
- Should be ready to pick up where we left off next year"
Links
Github Repository
Github: https://github.com/dusty-turner/ncaa_tournament_2021_beat_navy
git clone git@github.com:dusty-turner/ncaa_tournament_2021_beat_navy.git
QR Code - in development